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Echo state network (ESN) is viewed as a temporal non-orthogonal expansion with pseudo-random parameters. Such 
expansions naturally give rise to regressors of various relevance to a teacher output. We illustrate that often only a certain 
amount of the generated echo-regressors effectively explain the variance of the teacher output and also that sole local 
regularization is not able to provide in-depth information concerning the importance of the generated regressors. The 
importance is therefore determined by a joint calculation of the individual variance contributions and Bayesian relevance 
using locally regularized orthogonal forward regression (LROFR) algorithm. This information can be advantageously 
used in a variety of ways for an in-depth analysis of an ESN structure and its state-space parameters in relation to the 
unknown dynamics of the underlying problem. We present locally regularized linear readout built using LROFR. The 
readout may have a different dimensionality than an ESN model itself, and besides improving robustness and accuracy 
of an ESN it relates the echo-regressors to different features of the training data and may determine what type of an 
additional readout is suitable for a task at hand. Moreover, as flexibility of the linear readout has limitations and might 
sometimes be insufficient for certain tasks, we also present a radial basis function (RBF) readout built using LROFR. It 
is a flexible and parsimonious readout with excellent generalization abilities and is a viable alternative to readouts based 
on a feed- forward neural network (FFNN) or an RBF net built using relevance vector machine (RVM). 

Keywords: Echo State Networks (ESN), local regularization, Bayesian evidence procedure. Orthogonal Forward 
Regression (OFR), variable selection. Radial Basis Function (RBF) 
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1. Introduction 

ESNs are a novel class of recurrent neural networks 
(RNN) [1]. Their easy construction and simple training 
procedure are appealing and have attracted the attention 
of many researchers. The ESN model consists of the state- 
space update equation ([ij and the readout equation ^ 

x(fc + 1) = f (W"u(fc + 1) + Wx(fc) + W^'''y(fc)) (1) 



X 
J3 



y{k + 1) = W°"*[u(fc + 1), x(fc + 1) 



(2) 



where u(fc) is an L-dimensional input vector, y(fc) is an P- 
dimensional output vector and x(fc) is an M-dimensional 
echo-state vector. W™ e 7^*fx^ denotes an input weight 
matrix, W G 7^*^^^ denotes an internal weight matrix 
and W^'' e TZ^'^^^ is a feedback matrix. Vector function 
f is applied element- wise to its arguments. The most com- 
mon choice for f is either a vector of sigmoid or identity 
functions. 

Mathematically, the state-space equation ([T]) represents 
a non-orthogonal temporal expansion of teacher and input 
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signal onto a higher-dimensional space. The expansion is 
carried out so that diverse echoes of an input and teacher 
signal are generated (hence the name echo-state). This 
diversity, which should appropriately " explain" a variance 
of a teacher signal, is the key to the successful training of 
an ESN. Traditionally, the weight matrices W™, W, W-'^'' 
are generated in a pseudo-random manner with no compli- 
cated estimation being involved [T] . Nonlinearities in f are 
usually chosen by a trial-and-error approach ^ [2] . This 
approach usually suffices to generate many diverse signals. 
Signals generated by the expansion are successively used 
as regressors by the linear regression readout mechanism 
with the readout weights W°"* being the only parameters 
to be estimated. 

Although the training procedure is simple and trans- 
parent, constructing an ESN model that generalizes well 
is not straightforward and usually involves a considerable 
number of trials with different ESN parameters. There 
has been a considerable effort to improve the original ESN 
model. Some optimizations can be done before expand- 
ing echo-signals in state-space parameters (mostly in the 
weight matrices and in the vector of update functions f ) 
[31 m [51 ini [7] and some improvements are possible after the 
expansion in the readout equation [5J [HI [IHl [HI [H] ■ 

The motivation for adjusting the various parameters of 
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the state-space equation is to obtain a stable state-space 
update and to generate echo-regressors so that their hnear 
compounds explain the variance of the response variable 
sufficiently well - up to a desired accuracy. Stability of the 
state-space update is of crucial importance but research 
along this line is not targeted here. This study analyzes 
the statistical quality of regressors generated by the ESN 
expansion and shows how results of such analysis can be 
advantageously used in a variety of ways as described later 
in the text. 

The state-space equation generates non-orthogonal echo- 
regressors of various relevance to the response variable. 
Some of the generated echo-regressors explain a more sig- 
nificant part of the response variable variance and some 
less. The number of echo-regressors is often excessively 
high (up to few hundreds or more) with some of the gener- 
ated echo-regressors often being collinear (multicollinear- 
ity) . A high number of echo-regressors and multicollinear- 
ity contributes to the undesired numerical instability of 
the linear readout regression model where a small change 
in regressors causes a large change in the response. Gen- 
eralization abilities of such an unstable model are usually 
low and the model itself might be meaningless. More- 
over, models with many parameters may easily fit into the 
noise of the response variable and are thus prone to over- 
fitting. From the standpoint of regression modeling, it is 
therefore desirable to reduce the number of readout pa- 
rameters, especially by removing (or penalizing) collinear 
and less significant regressors. It is also worth to point 
out that echo-regressors are "loosely" coupled only in the 
state-space equation. The readout mechanism may there- 
fore use selected significant echo-regressors only so that a 
stable readout model with a good generalization perfor- 
mance is obtained. A traditional linear regression readout 
measures the suitability of fit only in terms of mean square 
error (MSE). This is often unsatisfactory because no in- 
formation concerning the quality of the generated echo- 
regressors is given. 

Various improvements addressing the mentioned issues 
have been proposed [111 I12j . Ridge regression combined 
with pruning of output weights was tested in [TT]. Prun- 
ing decreased the undesired effects of multicoUinearity and 
this resulted in an increased performance and better gen- 
eralization. Pruning is however computationally exhaus- 
tive and ridge regression regularizes a least-squares (LS) 
solution only globally using a single regularization param- 
eter which is estimated using the grid search. Performance 
was observed only in terms of MSE which limits further 
analysis. Work in (12j presents a joint estimation of local 
(individual) regularization parameters and delay parame- 
ters in delay&sum (D&S) readout based on a variational 
bayesian approach. The algorithm is computationally ef- 
fective and capable of an accurate estimation of delay pa- 
rameters which vastly improves the memory capacity of 
an ESN and provides a deeper insight into the temporal 
structure of the underlying problem. Joint estimation of 
regularization parameters then further enhances the ro- 



bustness of the model. 

Regularization parameters smooth model response by 
penalizing individual echo-regressors according to their "rel- 
evance" to the smoothed "noise- free" version of the teacher 
signal. This alone while significantly improving generaliza- 
tion does not fully indicate the usefulness of the individ- 
ual echo-regressors. Our work experimentally illustrates 
this phenomenon and we propose to study individual echo- 
regressors using locally regularized orthogonal forward re- 
gression (LROFR) so that both their individual variance 
contributions and their Bayesian relevance is jointly de- 
termined. This can be advantageously used in a variety of 
ways. The appropriate dimensionality for an ESN read- 
out may be determined and the readout may have a lower 
dimensionality than the state-space update itself which 
naturally boosts the robustness and is useful when e.g. 
augmenting echo-states. The effectiveness of an expan- 
sion is transparently determined and this enables an in- 
depth evaluation of ESN state-space parameters and ESN 
structure; e.g. determining the suitability of an additional 
nonlinear readout discussed shortly or other aspects like 
analysis of multiple reservoirs etc. discussed later in the 
text. 

A linear readout in general has, however, limitations in 
terms of the model's flexibility which might sometimes be 
insufficient for a task at hand. LROFR analysis is able to 
determine whether this is the case and suggests whether a 
more flexible readout mechanism should be used. Such a 
readout usually requires less parameters than a linear read- 
out and approximates the response variable with better ac- 
curacy. Feed forward neural networks (FFNN) and RBF 
nets constructed using Relevance Vector Machine (RBF- 
RVM) are perhaps the most popular examples of such 
models ^131 dH US] . Advantages of the both models is that 
they often reduce the number of readout parameters, at- 
tenuate effects of multicoUinearity by using entire internal 
state as an input, and often explain a higher portion of the 
variance of the response variable than linear compounds 
of original echo-regressors in a linear readout Train- 
ing of an FFNN requires a nonlinear least-squares iterative 
search algorithm, based on gradients, which has a consid- 
erably high computational cost. The extreme flexibility 
of this model requires cautious training to prevent over- 
fitting and obtaining a meaningful model is not straight- 
forward. This is contrary to the fast and appealing LS 
estimation of parameters in a traditional linear readout 
of an ESN. Kernel support vector machines such as RBF- 
RVM received considerable attention and popularity over 
the last decade and may also be used as a readout mech- 
anism for an ESN. In comparison to FFNN, RBF-RVM is 
easier to train and possesses some computational advan- 
tages. Recently, however, it was shown that RBF-RVM 
suffers numerical instability when dealing with complex 
and highly nonlinear data and that its abilities have per- 
haps been overstated [15 . 

Locally regularized RBF readout where RBF net is 
constructed using LROFR with D-Optimality cost (RBF- 
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LROFR-DOPT) is presented for cases where LROFR anal- 
ysis indicates that a more flexible readout should be used. 
This readout has an excellent generalization performance 
and possesses computational advantages to readouts based 
on FFNN/RBF-RVM [15]. 

The paper is organized as follows. Variable selection 
that is used to analyze the individual echo-regressors is 
briefly presented first. Next, locally regularized RBF mod- 
els are explained in Section |3] Analysis of an ESN using 
the presented variable selection mechanisms is then pro- 
vided in Section |4] and results in a proposal of locally reg- 
ularized linear readout. Locally regularized RBF readout 
is subsequently presented as a viable alternative to read- 
outs based on FFNN/RBF-RVM. Section [s] illustrates our 
modeling strategy using a real-world noisy task and a syn- 
thetic noiseless task. Future directions for research are 
suggested in the discussion of Section [6] Finally, the mer- 
its of the proposed modeling strategy are summarized and 
new prospective applications are outlined in the conclu- 



2. Variable Selection 

2.1. Orthogonal Forward Regression 

The Orthogonal Forward Regression (OFR) algorithm 
uses the advantages of orthogonal decomposition of the 
design matrix to compute the individual contribution of 
each regressor to the response variable variance |16j . 

The regression model is to be built from N-sample data 
set {x(fc), y(A;)}^^ where x(fc) G TZ^^ and y{k) £ TZ are 
the fc-th training input vector and corresponding desired 
scalar response, respectively. The regression model may 
be expressed in the matrix notation 



X/3- 



(3) 



where y = . . . , y(iV)]^ 



[e(l),...,e(7V)F, 



[Pi, ... , Pm]'^ , and X is the design matrix of size N x M 
with its rows x(l), . . . ,x(7V). Orthogonal decomposition 
of the design matrix X may be written as follows 



where 



X QR 



Q = [qi, • • ■,^M] 



(4) 



(5) 



is a design matrix with orthogonal columns satisfying 
q^q^ = if i 7^ and R is an upper triangular matrix. 



R 



1 ri^2 
1 

••• 



1 



(6) 



The regression model can be then expressed using the or- 
thogonal design matrix Q. 



Minimizing standard least-squares error criterion J = e^e 
(i.e. setting @J/@g=0) yields a vector of regression coef- 
ficients g = (Q"^Q)~^Q"^y which satisfies the triangular 
system R/3 — g. Solving the triangular system obtains the 
original regression coefficients /3. Although computing the 
regression coefficients in such a way has advantages [T7], 
the algorithm here is concerned with a variable selection 
in a forward regression manner using the advantages of 
orthogonality. 

Some useful transformations may be carried out with 
the orthogonal regression model as follows. 

y = 5iqi H \- 9ai(Im + e (8) 

Orthogonality of the regressors q^ allows us to write 

y^y = .9?qf qi + ■ • • + glinJinM + e^e (9) 

If mean of the response variable y is then its variance 
equals to 

^"V^y = N-\glqlm + ■■■+ g'WMqM) + N'^e^e 

(10) 

The variance of y is therefore expressed by the variance 
explained by the model (the regressors) and unexplained 
variance of the error term. Because the regressors do not 
interact (they are orthogonal), it is possible to compute 
their individual variance contributions to y. Each indi- 
vidual regressor increases the variance explained by the 
model and reduces the unexplained variance of the error 
term. This error reduction ratio for an i-th single regressor 
can be expressed as follows. 



err, = N-^gfqfq,/N-'y'y = gM^h/y'v (H) 



2„T, 



The variance equation ( 10 1 may be then alternatively ex- 
pressed as follows 



M 

1 = erri 

i=l 



r I 'J- 

e/y y 



(12) 



y = Qg + e 



(7) 



and the unexplained variance ratio then simply is 

M 

eVy^y = l-I]em. (13) 

The algorithm builds a sub-model by selecting Mguh signif- 
icant regressors (usually M^uh ^ A^) in a forward regres- 
sion manner based on the error reduction ratio. Selection 
is terminated when user-specified tolerance < f < 1 for 
the unexplained variance ratio is reached. 

1 - < ^ (14) 

1=1 

Alternatively, all available regressors are gradually selected 
and the unexplained variance ratio is observed after each 
selection. It is often the case that after selecting a cer- 
tain number of significant regressors, introducing more re- 
gressors causes the unexplained variance ratio to decrease 
only marginally. Such regressors often contribute to ill- 
conditioning and the corrupt overall statistical quality of 
the model. 
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2.2. Locally Regularized OFR 

Regression models built using OFR may still fit into 
the noise of the response (overfitting) because selection is 
based purely on minimization of error. Local regulariza- 
tion appropriately smoothes model response by penalizing 
regressor terms to prevent overfitting. Locally regularized 
OFR (LROFR) UHl [H] adopts the following regularized 
error criterion 



T 

e e 



E 



T 

e e 



g Ag 



(15) 



where A = diag{\i, . . . , Am} and \i is an i-th regulariza- 
tion parameter. As in the case of the OFR, Eq. ([9]), it can 
be shown [TS] that the orthogonal regression model may 
be written as 



T 

y y 



M 

E 



T 

e e 



g Ag 



(16) 



and the error criterion can be thus expressed as 

M 

e^e + g^Ag ^ y^y - ^ g2(qTq^ ^ (^7) 

i-l 



Similarly to the Eq. (13), normalizing (17) by y y gives 

M 

(e^e + g^Ag)/y^y = 1 - J] fff (qfq, + A,)/y^y. (18) 

i-l 

Analogously to the OFR, the regularized error reduction 
ratio due to is then 



rerr, = gj{q^q.^ + A,)/y' y. 



(19) 



Significant regressors are selected based on rerr criterion 
in a forward regression manner |18j . Selection finishes 
when user-specified tolerance < ^ < 1 is reached. 



E 



rerri < ^ 



(20) 



This produces a sparse model of Mg^h M regressors. 

Regularization parameters Ai are initially unknown. 
They are all set to a small value (e.g. 0.01) and a pass 
of an OFR using rerr criterion is carried out over the ini- 
tial full model. A; of a resulting sub-model is updated 
using the Bayesian evidence procedure [T^ and an OFR 
using rerr criterion with the updated A^ is carried out over 
the previously generated sub-model. This procedure is ap- 
plied iteratively until \i remains sufficiently unchanged be- 
tween two iterations. In general, each iteration generates 
a sub-model from a model generated in the previous itera- 
tion and updates A^ used in a next iteration. This can be 
schematically expressed as follows. 



X(2) 



where Mf^ii > -^2 > • • • > M final ■ The algorithm is com- 
putationally effective and the number of regressors often 
decreases dramatically within the first few (e.g. 4-5) it- 
erations. A few more iterations are usually required for 
regularization parameters to converge. Typically about 10 
iterations in total suffice to construct final parsimonious 
model with the design matrix 'X.-^™"-^ . 

2.3. Locally Regularized OFR with D-Optimality Cost 

LROFR algorithm can be further enhanced by the D- 
Optimality cost which effectively maximizes determinant 
of (X-^X) and thus improves model robustness [18]. LROFR 
with D-Optimality cost adopts the combined criterion 

M 

JcRis, A, 13) = Jfl(g, A) + /3 ^ ~log{qfq,). (22) 

1=1 

The algorithm is identical to the LROFR but the selection 
is governed by the combined regularized error reduction 
ratio. 

crerr, = ^^((qf q, -|- A,) -|- l3log {q[q,))/y^y (23) 



Stopping rule (20) for a single iteration is not necessary 



any more because after selecting a certain number of sig- 
nificant regressors all remaining unselected regressors will 
have their crerri < 0. Finding an appropriate value of the 
user-specified parameter /3 is usually quick and straight- 
forward. 



3. RBF Regression Modeling 

The aim in nonlinear system identification is to ap- 
proximate (identify) dynamics between observed inputs 
and outputs of interest. The discrete-time system to be 
identified is in the following form 

y{k) = /(u(fc), ...,u{n- lu);y{k- 1), ...,y{k - ly)) + e{k); 

(24) 

where u(k) is the input vector and y{k) is the scalaiF] out- 
put for the time step fc, 1^ and ly are the time delay lags in 
the input and output respectively and e(fc) is the system 
white noise. Letting 

x(fc) = [u(fc), u(n - lu);y{k - 1), y(fc - ly)] (25) 

reduces ( pi] ) into 

y{k)^f{^ik)) + e{k); (26) 

The system is to be identified from N-sample data set 
{x(fc), ?/(A:)}^-^. One of the possible approaches to ap- 
proximate / is the RBF regression modeling. 

M 

y{k) = m + e{k) = eM^{k)) + e{k), (27) 

i=l 



(21) 



^Extension to multi-dimensional vector output is straightforward, 
scalar notation is used for sake of simplicity. 
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where M denotes the number of RBF regressors, 9i is the 
i-th regression coefficient and <j)i{x{k)) is its corresponding 
i-th RBF regressor with centre and variance Vi. 



(/.,(x(fc)) =^(||x(fc)-c,||/u,) 



(28) 



1 1 . 1 1 is the Euchdean norm and function ip is an RBF non- 
hnearity. The thin plate spline (TPS) function (Eq. 29) 



and the Gaussian function (Eq. 30 1 are the two most com- 
mon choices. 

M^) = X^logix) (29) 



2/2 



(30) 



The regression model of Eq. ( 27 ) can be written in a 
matrix form 

*6»-fe (31) 

[e{l),...,e{N)f, e = 
[<Pi, . . . , 4>m]'^ is the design matrix 

: [0,(X(1)),...,0,(X(7V))],1 < i < 



[2/(1),... 
1^ and * 



y 

y{N)f, e 



where y 
[6i,. . . , 9m\ 
with its columns (j)^ 
M. 

The crucial issue in the RBF modeling is to choose a 
set of basis functions ipi so that resulting model is robust, 
parsimonious and generalizes well. A common approach is 
to generate many basis functions first, and then iteratively 
select a suitable subset until a final parsimonious model is 
generated. Each generated (pi is defined by its centre 
and variance Vi. The variance Vi is often fixed to a single 
value V for each Let Cj = x(i), 1 < i < M, and set 
M = N which effectively makes every data point x(i) a 
candidate for a centre. This makes the number of gener- 
ated basis functions 4>i equal to the number of data points 
and matrix €> will result in size oi N x N. Let the design 
matrix of this initial full model be denoted by ^.^''^ and 
the design matrix of the final model by LROFR 
algorithm combined with D-Optimality cost is used to iter- 
atively select useful regressors from (iV x Mfuii) into 

^f^nal ^ M/w) whcrC Mf.nal « Mf^ll [HI |I5] . In 

each iteration, the algorithm selects a subset of regressors 
from the previous iteration in a forward regression man- 
ner and updates the regularization parameters used in the 
next iteration. 

4. ESN Regression Modeling 

4-.1. Linear Readout 

The state-space update equation Q is used to extract 
temporal or spatial patterns from original N-sample data 
set {u(A:),y(fc)}^;^ where u(fc) e 72.^ and y{k) € 'RP are 
the fc-th training input vector and corresponding desired 
vector response, respectively. Echo-states x(fc -I- 1) are 
sampled via the Eq. ([T]) using the original input vectors 
\i{k + 1) and/or output vectors y(A;), and previous echo- 
state x(fc) itself. Some initial echo-states from the sam- 
pling are discarded to avoid the infiuence of the undefined 
random states x(0) and y(0) at time fc = 0. The rest of 



the echo-states are then stored as rows of the design ma- 
trix X. Let rows of the matrix y be the vectors y{k)^. 
The ESN regression readout model can be then expressed 
in the matrix form 

y = XW°"* + e (32) 

where W°"* are the regression weights (coefficients) and e 
are random errors with common variance and zero mean. 
The regression weights W°"* are traditionally estimated 
using the least-squares (LS) method 



(X^X)-iX^y. 



(33) 



Sampling via the state-space Eq. ([!]) generates echo- 
regressors (columns of the design matrix X) of various 
relevance to the desired response. The shape and charac- 
teristics of the echo-regressors are indirectly governed by 
the ESN state-space parameters (pseudo-random weight 
matrices W™, W, W.'''', nonlinearity f, etc.). These pa- 
rameters are adjusted (mostly using intuition and experi- 
ence) so that the linear compounds of the generated echo- 
regressors approximate the response with a desired accu- 
racy. 

Obviously, an expansion with pseudo-random parame- 
ters gives rise to regressors of various importance. As out- 
lined in the introduction, the number of echo-regressors 
is often excessively high with some of the echo-regressors 
being coUinear. High dimensionality of the readout and 
multicoUinearity contributes to numerical instability of the 
readout and it is often the case that LS estimates of W°"* 
have large variances and inappropriately large mean val- 
ues. Models with many parameters are also prone to over- 
fitting because they may easily fit into the noise of the 
response. Taking these issues into account, it is desir- 
able to reduce dimensionality of the readout, particularly 
by removing or penalizing the undesired coUinear and less 
significant echo-regressors. 

These issues may be addressed by local regularization 
applied to all the generated echo-regressors [12]. Regular- 
ization penalizes individual echo-regressors based on their 
Bayesian relevance so that the readout response is smooth 
and does not fit the into noise. This alone while signifi- 
cantly improving generalization does not fully indicate the 
importance of the individual echo-regressors. Examina- 
tion of the regularization parameters will only show which 
regressors had to be attenuated so that model response 
is "smooth enough". Such analysis is therefore unsatis- 
factory for relating particular echo-regressors and overall 
ESN structure to different features of the training data. 

To overcome this limitation we study the individual 
quality of the echo-regressors using LROFR. Joint analy- 
sis of the individual variance contributions and Bayesian 
relevance of the echo-regressors using LROFR is able to 
give more accurate information concerning the importance 
of particular regressors. This information can be used in a 
variety of ways for an in-depth analysis of an ESN struc- 
ture and its state-space parameters in a relation to the 
unknown dynamics of the underlying problem. 
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In the following we will describe how to compute the 
individual significance of echo-regressors by using LROFR 
algorithm. 

4-2. Analysis of echo-regressors using the OFR algorithm 

The individual variance contributions of the echo-regressors 
can be calculated using the OFR algorithm given in Sec- 
tion |2] The stopping rule ( 14 1 is left out and all gener- 
ated echo-regressors are gradually selected into the read- 
out model based on their respective error reduction ratios. 
The unexplained variance ratio (variance ratio of the er- 
ror term) is observed after each selection to see by how 
much an introduction of a new echo-regressor decreases 
the ratio. Some of the generated echo-regressors explain 
a higher portion of the response variance and some less. 
It is therefore often the case that in the beginning of the 
selection process the unexplained variance ratio decreases 
sharply while selecting more significant regressors into the 
readout model. Gradually, the unexplained variance ratio 
starts to decrease substantially slower when the algorithm 
selects from the remaining echo-regressors that are either 
non-significant or collinear with any of the already selected 
echo-regressors. Observing this selection process reveals 
the redundancy of an ESN expansion; i.e. roughly how 
many of the echo-regressors are important and how many 
are not so. This information may be used to determine 
the appropriate dimensionality for the linear readout so 
that a parsimonious and more stable readout is obtained. 
Moreover, it can in turn be used to suggest the dimension- 
ality of the state-space equation and it helps to evaluate 
suitability of its pseudo-random parameters too. 

This analysis is also able to determine whether more 
flexibility is required either in readout response surface or 
in readout temporal processing capability. If the analysis 
indicates a need for a more flexible response surface, one 
may use RBF readout built using LROFR-DOPT which is 
presented later. D&S readout may be used in cases where 
a need for more temporal capability is indicated. 

4-3. Locally regularized linear readout 

Analysis using OFR shows how many and which of the 
generated echo-regressors are useful for modeling a task of 
interest. The principle of selecting echo-regressors which 
maximize variance explained by linear readout model alone 
is however often unsatisfactory because the constructed 
readout model may fit into the noise of the response rather 
than into the true response curve itself. Local regulariza- 
tion appropriately smooths the solution obtained by OFR 
so that model response does not (over) fit into noise. Regu- 
larization also naturally attenuates instabilities caused by 
multicollincarity and improves model robustness. 

Locally regularized linear readout uses LROFR to es- 
timate W°"*. The stopping rule (20) is again left out so 
that all echo-regressors in X are analyzed. Computation 
of regularization parameters Xi usually converges within 
8-10 iterations and echo-regressors in X are then ordered 



based on their importance to the response variable of in- 
terest. The very first iteration can actually be used for 
observing the unexplained variance ratio of the OFR anal- 
ysis because regularization parameters Xi are set ,in the 
beginning, uniformly to extremely small values. 

Observing the evolution of the unexplained variance ra- 
tio in the first iteration of LROFR, and examining vector A 
with order of regressors in X give richer and more profoimd 
information concerning the importance of echo-regressors 
and suitability of an ESN structure with its state-space 
parameters in a relation to the unknown dynamics of the 
underlying problem. 

One may change the state-space parameters of an ESN 
(W™, W, W'^'', nonlinearity f , dimensionality, etc.), carry 
out LROFR, and observe how different parameters influ- 
ence the quality and redundancy of echo-regressors. Such 
analysis provides in-depth information concerning how an 
ESN reacts to the data of interest and may be used for re- 
lating particular echo-regressors and overall ESN structure 
to different features of the training data. It is important 
to point out that sole regularization is not able to provide 
as deep an insight into the efficiency of an ESN expansion 
as a joint analysis of variance contributions and Bayesian 
relevance using LROFR. This will be experimentally illus- 
trated in Section [5] 

Vector of output weights W°"* estimated by LROFR 
is then re-ordered as was the original order of regressors 
in X for use with the state-space equation. Alternatively, 
the unnecessary echo-regressors may be completely left out 
which effectively reduces the dimensionality of the read- 
out and is advantageous when further augmenting original 
echo-regressors by e.g. second power. 

4.4- Locally regularized RBF readout 

Flexibility of a linear readout might sometimes be in- 
sufficient for the data under consideration. Lack of read- 
out flexibility is indicated by an unsatisfactory MSE which 
indicates that the original regressors do not explain the 
response variable sufficiently well and that a transforma- 
tion of the original regressors should be carried out. MSE 
alone, however, does not reveal much information concern- 
ing the entire ESN model. Analysis using LROFR shows 
how the variance ratio of the error term (or MSE itself) 
drops with each selection of a new echo-regressor. If the 
state-space equation generates echo-regressors of low rele- 
vance to the response, the unexplained variance ratio will 
drop a little with each selection of a new regressor. This 
indicates that a more flexible readout will probably explain 
the response variable with higher accuracy while requiring 
less parameters. 

Locally regularized RBF readout built using LROFR- 
DOPT is a flexible nonlinear modeling strategy that gen- 
erates parsimonious models with excellent generalization 
performance. It is numerically more stable than RBF- 
RVM and possesses the computational advantages of non- 
linear models linear-in-parameters. Locally regularized 
RBF readout for an ESN is constructed as described in 
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Section[3]but vector x(A;) in ( pS] ) is combined using u(A;), y(fc- 
1) via the state-space equation ([T]); i.e. x(fc) equals to a 
fc-th row of the ESN design matrix X. The readout may 
be expressed as 



y{k + 1) = fout^^out^fUBl' (^(^ ^ 



(34) 



where f^^^ = [0i(x(A: + 1)), . . . , 0M(x(fc + 1))] and M 

equals to M/,„„, after apphcation of LROFR with D-Optimahtj^j^^g b^^^^^^ ^^^^ elements of the matrix W 
cost to the full RBF model. 



5. Modeling Examples 

The following tasks are used to illustrate the analysis 
of an ESN model using LROFR algorithms. Joint orthog- 
onal variance analysis combined with Bayesian relevance 
in LROFR is experimentally tested. Locally regularized 
RBF readout is additionally used where appropriate if the 
analysis indicated a need for a more flexible readout. The 
first presented task is a noisy real-world application and 
the second task is a synthetic noiseless benchmark task. 

5.1. Synthesis of Handwritten Characters 

An ESN was successfully used to investigate and model 
the naturalness of handwritten characters which is defined 
as being the difference between the strokes of the hand- 
written characters and the canonical fonts on which they 
are based |19j . In general, the naturalness learning frame- 
work defines naturalness as the difference between target 
human-like behavior and the behavior of an original canon- 
ical system which resembles the desired target human-like 
behavior but lacks human idiosyncrasy. In the handwrit- 
ing example, the canonical system comprises the strokes of 
an original font character and the naturalness of the differ- 
ences between handwritten strokes and their correspond- 
ing original font strokes. If it were possible to generate the 
appropriate naturalness (differences) for the strokes of a 
font character, then synthesizing a handwritten character 
would be reduced to the simple addition of the naturalness 
to the original font strokes. 

It was shown that the relationship between certain 
properties of font strokes (canonical system) and their nat- 
uralness exists and can be modeled using an ESN [THl [20] . 

Font strokes were expressed as a 2D vector field com- 
prised of vectors between successive evenly spaced points 
of font strokes. Naturalness was expressed as a 2D dis- 
placement vector field between handwritten strokes and 
their corresponding original font strokes. In other words, 
an ESN is used to model a relationship between a 2D ex- 
planatory variable (font strokes) and a 2D response (nat- 
uralness). Following the original work [TS^, the training 
data matrices U (2D explanatory variable) and V (2D re- 
sponse) were established at sizes 2704 x 2. 

The network is comprised of 300 units using the Gaus- 
sian RBF activation function with zero mean and vari- 
ance of 1. The internal weights in the matrix W were 



randomly assigned values of 0, 0.2073, —0.2073 with re- 
spective probabilities 0.95, 0.025, 0.025. For a 300 x 300 
matrix W, this implies a spectral radius of « 0.85, provid- 
ing for a relatively long short-term memory [2] . Two input 
and two output units were attached. Input weights were 
randomly drawn from a uniform distribution over (—1,1) 
with probability of 0.9 or set to 0. With this input ma- 
trix, the network is strongly driven by the input activa- 

are non- 
zero values. The network has output feedback connections 
W'"^'=^ which were randomly set to one of the three values 
of 0, 0.1, —0.1 with respective probabilities 0.9, 0.05, 0.05. 
With this configuration for the feedback connections, the 
network is only marginally excited by previous output ac- 
tivations; using higher values for the feedback connections 
was found to lead to the network becoming unstable when 
running on its own |19j . 

The ESN was driven by the samples in U and V with 
a washout period being set to 300. The first 300 internal 
states were discarded. The network internal states x(fc) 
from fc = 301 through k — 2704 were collected and saved 
as rows of the design matrix X. 

First, the traditional linear readout is used and its per- 
formance is observed in the training, testing and modeling 
stages (Tab. [l|. A relatively low MSE in training is 
in contrast to a higher MSE in the testing and modeling 
stages. Otherwise, not much can be seen from an analy- 
sis using sole MSE. LROFR is then carried out to analyze 
the statistical quality of the generated echo-regressors in 
X against the response variable. Figures [l] and [2] plot the 
results of the OFR analysis (first iteration of LROFR). 
The unexplained variance ratio (i.e. variance ratio of the 
error term) decreases sharply and steadily until about 50 
echo-regressors were selected into the readout. Selecting 
more echo-regressors provided a moderate but still steady 
drop in the unexplained variance ratio until about 150 
echo-regressors were selected. Selecting even more echo- 
regressors gave a rather negligible drop in the unexplained 
variance ratio, especially the region when 180 to 300 echo- 
regressors were selected. Clearly, about half of the echo- 
regressors contribute very little to the variance of the re- 
sponse variable. 

After convergence of the regularization parameters Aj 
(10 iterations), vector A and corresponding regression weights 
were examined. About the last 60 echo-regressors had 
an extremely large which in turn caused their corre- 
sponding regression weights to became virtually zero (i.e. 
extremely small). The OFR analysis (first iteration of 
LROFR) however showed that about half (150) of the 
generated echo-regressors contribute little to the explained 
variance of the training response. This shows that an anal- 
ysis based purely on regularization (Bayesian relevance) is 



^Training stage uses teacher forcing of the training data. Testing 
stage uses the same training data but the ESN runs on its own. Mod- 
eling stage uses also inputs (characters) which were not introduced 
during the training stage and the ESN runs on its own. 



7 




50 100 150 200 250 300 

number of echo-regressors 



Figure 1: OFR (first iteration of LROFR) using the echo regressors 
against the x component of the response variable. 
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Figure 2: OFR (first iteration of LROFR) using the echo regressors 
against the y component of the response variable. 

not able to fully reveal the importance of the generated 
echo-regressors. Joint analysis of the individual variance 
contributions and Bayesian relevance using LROFR gives 
more accurate information concerning the importance of 
the regressors. 

The regression weights W°"* were then reordered as 
was the original order of the echo-regressors in X so that 
the readout could be attached to the state-space equation 
in the testing and modeling stage. Locally regularizing the 
linear readout (i.e. attenuating the undesired regressors) 
increases the training MSE but improves stability and ro- 
bustness of the readout, and thus, decreases the MSE in 
the testing and modeling stage (see Tab. [l] for details). 

Information that about half of the regressors are of 
little importance indicates that perhaps a more flexible 
readout with less parameters may explain the variance of 
the response variable with higher accuracy and better ro- 
bustness. A locally regularized RBF readout was therefore 
tried to see whether it could improve performance. The 
variance of the Gaussian RBF was set u = 65 and D- 



Table 1: Mean square errors and correlation coefficients for all three 
readouts. Readout type: #1 - linear, #2 - locally regularized linear, 
#3 - locally regularized RBF. 



readout 


training 


niSCx.y 






9.36 X 10-'' 


6.74 X 10-4 


0.9758 


0.9733 


it'-' 


9.59 X 10-4 


6.91 X 10-4 


0.9752 


0.9727 


#3 


1.08 X 10-'^ 


1.36 X 10-3 


0.9719 


0.9457 




testing 




c 


#1 


1.53 X 10-2 


1.86 X 10-2 


0.6086 


0.3776 


#2 


1.38 X 10-2 


1.49 X 10-2 


0.6299 


0.4175 


#3 


8.40 X 10-3 


8.03 X 10-3 


0.7580 


0.6316 




modeling 






#1 


2.18 X 10-2 


2.65 X 10-2 


0.4372 


0.2405 


#2 


1.83 X 10-2 


1.97 X 10-2 


0.4646 


0.2543 


#3 


9.08 X 10-3 


9.29 X 10-3 


0.6619 


0.4874 



Optimality weig hting /3 = 10-4. The final RBF readout 
for the X component of the response had 142 RBFs and 
for the y component of the response 124 RBFs. Results 
are given in Tab. [l] It is evident that the locally regular- 
ized RBF readout is superior in the testing and modeling 
stages. Interestingly, the number of the RBF nodes is in 
line with the results of the LROFR analysis given in Fig. [T] 
and Fig. [2] The figures show that only about 130-150 out 
of the total 300 regressors are significant. This suggests 
that an additional nonlinear flexible readout may explain 
the response variance with about the same or lower di- 
mensionality while providing improved accuracy and ro- 
bustness. Dimensionality of the RBF readouts was found 
to be similar to the number of significant echo-regressors 
and fiexibility of the nonlinear RBF response surface re- 
sulted in an improved performance as outlined in Tab. [l] 
The LROFR analysis also suggests that the parameters 
of the state-space equation in the handwriting task could 
be designed and optimized in a better way (e.g. different 
dimensionality, nonlinearity in f, weight matrices, etc.) so 
that less collinear and non-significant echo-regressors are 
generated. Although we are here concentrating on an im- 
provement of the readout mechanism itself, the presented 
analysis is useful for an improvement and better under- 
standing of an entire ESN model structure and is discussed 
further in Section |6l 

5.2. Mackey- Glass system 

The Mackey-Glass (MG) delay differential equation is 
a standard benchmark task for time-series modeling. 

^W=a^f|^-7.W (35) 

The system parameters are usually set to a = 0.2, /3 — 10, 
and 7 = 0.1. If t > 16.8 then the system has a chaotic 
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attractor. The values of r are often set to 17 (mild chaotic 
behaviour) or 30 (wild chaotic behaviour). In our experi- 
ments we set T — 30. We followed the work in [I] and the 
MG system was approximated in discrete time 

yik + l)=yik) + Si0.2^^^^^^I^-0Mk)) (36) 

with S which denotes a stepsize being set to (5 = 10 as 
in [Ij so that results may be compared. A 3000 point 
MG sequence was than generated and used in the training 
stage. Figure |3] depicts the last 1000 points of the training 
sequence. The training sequence was transformed y — > 
tanh{y— 1) to fit into interval (—1, 1) so that the sequence 
may be used with sigmoid internal units. 



1.5 




)i , , 1 1 1 , , , , 1 

100 200 300 400 500 600 700 800 900 1000 



Figure 3: A sequence of the Mackay-Glass system with r = 30. 

An ESN used for modeling the MG system comprised of 
400 internal units with standard sigmoid activation func- 
tion (tanh hyperbolic tangens). The internal weights in 
the matrix W were randomly assigned values 0, 0.4, —0.4 
with probabilities 0.99, 0.005, 0.005 respectively. One in- 
put unit was attached to feed in a constant bias of value of 
0.2. The input weights W™ were randomly set to values 
0,0.14,-0.14 with respective probabilities 0.5,0.25,0.25. 
One linear output unit was attached and the feedback 
weights W-^'' were drawn from a uniform distribution over 
[—0.56,0.56]. Similarly to 1], a uniform noise term over 
(— 10~^, 10~^) was included in the state-space equation to 
obtain a less correlated echo-regressors and a more stable 
state-space update. 

The training sequence was fed into the network via the 
feedback weights and the internal states were collected into 
the design matrix X. The first 1000 steps were discarded 
as an initial transient resulting in the design matrix of size 
of 2000 X 400. Similarly, the first 1000 points were deleted 
from the training sequence for the regression task. 

The ordinary linear readout was tested first and its 
MSE was calculated. The MSE was found to be relatively 
low which shows that the linear compound of the generated 
echo-regressors could explain the variance of the training 
sequence remarkably well. In fact, the readout gives an 
almost exact fit to the training sequence. This is often 
the case if an ESN setup is strongly autoregressive. In 
the MG example, the training sequence is fed back into 
the reservoir via a relatively large and dense W^^ which 
in turn causes many of the generated echo-regressors to 
strongly correlate with the training sequence (and hence 
the low MSE in the training stage). 



The design matrix was further analyzed using OFR. 
Figure |4] plots the results. It is easy to observe that the 
unexplained variance sharply drops with a selection of one 
single echo-regressor from X. This can be again attributed 
to the strongly autoregressive setup of the ESN which gen- 
erates echo-regressors that are highly correlated with the 
training sequence. This observation indicates that it will 
probably be difficult to generate a signal of similarly high 
correlation using even many RBFs in an RBF readout. 
Apparently, in this task a readout capable of generating 
a flexible response surface will most likely not provide an 
increase in performance. This fact together with the au- 
toregressive ESN setup however suggests that delaying the 
echo-regressors by means of D&S readout may on the other 
hand provide an increase in performance. 

1 1 1 1 1 1 1 1 1 1 
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Figure 4: OFR (first iteration of LROFR) using the generated echo- 
regressors against the response variable. 

Locally regularized linear readout was than tested to 
see how many echo-regressors will be attenuated using reg- 
ularization parameters. The analysis was particularly in- 
teresting because the training sequence is noise-free. Af- 
ter the convergence of the regularization parameters Xi 
(10 iterations), vector A and the corresponding regression 
weights were examined. About 50 echo-regressors had ex- 
tremely large which in turn caused their corresponding 
regression weights to became virtually zero (i.e. extremely 
small). Table [2] shows the training MSE. As expected, the 
locally regularized linear readout gives a slightly higher 
MSE because its response is smoothed. Smoothing was 
, however, gentle and it was not obvious to the naked eye 
when compared to the training sequence or to the response 
of the ordinary linear readout. Output weights in W°"* 
were then re-ordered to its original order so that they can 
be used with the state-space update in the testing stage. 

A locally regularized RBF readout was also tested. The 
results were however unsatisfactory. As we already pointed 
out, this can be attributed to the fact that within the 
generated echo-regressors there are already signals that 
strongly correlate with the response variable. It is difficult 
and meaningless to generate a signal of similarly high cor- 
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relation by adding together even many RBFs (even if each 
RBF node would have a tunable variance parameter). 



Table 2: Mackay-Glass task. Readout type: #1 - linear, #2 - locally 
regularized linear, #3 - locally regularized RBF. 



readout type 


TflSGij-d'lji 


NRMSEs4 


NRMSEi2Q 


#1 


1.44774 X 10-« 


0.127 


0.219 


#2 


1.51756 X 10"** 


0.121 


0.218 


#3 


N/A 


N/A 


N/A 



In the testing stage, 100 independent sequences of length 
1084 were generated. All sequences were transformed in 
the same way as the training sequence {y — ?> tanhijj — 1)) 
to fit them into interval (—1, 1) so that the sequences may 
be used with a reservoir that has internal units with sig- 
moid function. The trained ESN was run teacher-forced 
for 1000 steps and then left running free for 84 steps with 
each of the generated sequences. The difference between 
an output of the freely running ESN and a value of the 
true sequence in step 84 was calculated and averaged over 
all 100 trials by means of normalized root mean square 
error (NRMSE). The NRMSE in step 84 is defined using 
the following formula 



NRMSE' 



84 



100 



IOOct' 



^(y,(fc + 84)-y,(fc + 84))2 



1=1 



where i denotes trial number and cr^ variance of original 
attractor signaQ Definition of NRMSE in other steps (e.g. 
120) is straightforward. 

The two readouts (traditional linear and linear locally 
regularized) were tested and NRMSEsa and NRMSE120 
were calculated. Table [2] gives the results. It was inter- 
esting to observe that despite the fact that the MG series 
is noise-free, a locally regularized readout was found to 
perform better. It is worth to point out that a locally reg- 
ularized linear readout does not give as precise fit to the 
training MG sequence of 2000 points as the plain linear 
readout; i.e. some small fluctuations in the training se- 
quence are smoothed out when using local regularization. 
The results in Table [2] suggest that such minor smoothing 
is beneficial for many (noise-free) sequences in the testing 
state. 



^We followed the work in ^ and its source code to be able to 
compare results. In fact, the length of the original attractor was 
1000 -I- 84 X 100 instead of 1084 X 100. After the initial transient 
period of 1000 points, the internal state of the ESN was saved and 
the net was left running free for 84 steps. The difference between 
the true response and an output of the freely running net in step 
84 was calculated for its use in the NRMSE formula. Then, the net 
was ran now teacher-forced from the saved internal state for 84 steps 
and new internal state was saved. This procedure was repeated 100 
times and NRMSE was computed. 



To further investigate this phenomenon we repeated 
the testing stage ten more times always with new 100 in- 
dependent sequences. The results are presented in Ta- 
ble [3j As we can see, in some cases a locally regular- 
ized linear readout outperformed a plain linear readout, in 
some cases, however, it was the linear readout that per- 
:ormed better. Weights of both readouts were estimated 
using the MG sequence of 2000 points. As it is often the 
:ase, this training sequence does not capture the entire 
dynamics of the underlying system and precise fit (linear 
readout) works well only if testing sequences are similar to 
the training one. For some other sequences, however, the 
smoothed fit (local regularization) works better. This is 
the reason why in Table [3] there are testing stages where 
a locally regularized linear readout outperformed linear 
readout but also stages where it is the other way around. 



Table 3: Comparison of NRMSE^i and NRMSE120 for linear and 
locally regularized readout in ton independent trials. 





NRMSEs4 


NRMSE120 


readout type 


#1 


#2 


#1 


#2 


1 


0.108 


0.114 


0.192 


0.202 


2 


0.143 


0.139 


0.219 


0.211 


3 


0.152 


0.154 


0.273 


0.274 


4 


0.165 


0.155 


0.259 


0.242 


5 


0.158 


0.163 


0.249 


0.270 


6 


0.122 


0.134 


0.154 


0.162 


7 


0.125 


0.123 


0.148 


0.147 


8 


0.142 


0.141 


0.244 


0.232 


9 


0.156 


0.151 


0.246 


0.251 


10 


0.156 


0.150 


0.243 


0.241 



In practice, the length of a training sequence is of- 
ten limited and even noise-free data may benefit smooth- 
ing a model response using local regularization, especially 
systems with a periodic attractor. Furthermore, LROFR 
analysis provides a deeper insight into the dynamics of an 
ESN model. In the presented MG task, the LROFR anal- 
ysis (FigQ shows that there is no need for a more flexible 
readout (e.g. RBF readout) because some of the gener- 
ated echo-regressors already explain a high portion of the 
variance of the training sequence. Testing trials with both 
readouts show that there is probably still room for an im- 
provement of performance but this time not in terms of 
readout flexibility but rather in terms of extracting more 
information from the echo-regressors (e.g. extracting more 
temporal information using delay &:sum readout [3 [H]). 
Some suggestions for research along this line are outlined 
in the next section. 

6. Discussion 

The presented locally regularized readouts improve the 
accuracy, stability and robustness of an ESN model either 
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by direct penalization of the undesired echo-regressors us- 
ing locally regularized linear readout or by an effective 
transformation into RBF regressors using a locally regu- 
larized RBF readout if more flexibility is required. 

LROFR analysis helps to better understand whether 
the state-space parameters are suitable for the data under 
consideration. Different construction mechanisms for the 
weight matrices or activation functions give rise to echo- 
regressors of different shapes and quality. The presented 
analysis using the LROFR enables better understanding 
concerning the efficiency of these mechanisms. Some inter- 
esting examples of such mechanisms for the LROFR anal- 
ysis are as follows. The Hebbian adaptation rule is an in- 
teresting mechanism to optimize the weight values in W*" , 
W jW^^ for ESN models of moderate size (e.g. 100 inter- 
nal units) Other works are concerned with adapting 
the topology of the weight matrices rather than adapting 
a value of each particular weight [H |S] . Imprinting small- 
world and scale- free structures well known from graph the- 
ory onto a structure of the weight matrices yielded partic- 
ularly interesting results [5] . Different activation functions 
f in the state-space equation has also been studied. Be- 
sides traditional sigmoid or identity functions pQ |2] , radial 
basis functions (RBF) [2^, other non-monotonous func- 
tions [7] and parametrized activation functions in f [6 have 
been tested with improved performance and stability of the 
state-space update being observed. The LROFR analysis 
is capable of providing profound and detailed insight into 
these construction mechanisms; regarding their efhciency 
and suitability for a task at hand. 

Echo-regressors delayed by the D&S readout may also 
be analyzed in detail by the LROFR. All the delayed re- 
gressors may be analyzed so that the most significant de- 
lays are determined. The importance of the delays may be 
then used for complex analysis of the temporal structure 
of the underlying problem. The analysis will also show 
whether an additional flexible readout (e.g. RBF) should 
be used after the D&S readout so that robustness and ac- 
curacy is enhanced even further. 

There is a room for improvements in the presented 
readouts too. LROFR may be effectively combined with 
the PRESS statistics (leave-one-out criterion) so that the 
locally regularized linear readout automatically selects a 
significant subset of echo-regressors with no user-specified 
stopping rule being involved [21\. This would automat- 
ically determine dimensionality for the readout which is 
particularly desirable when augmenting internal states. An- 
other interesting alternative is to use the recently proposed 
Coordinate descent approach for finding significant subset 
of echo-regressors [22: . 

Generalization abilities and parsimony of locally regu- 
larized RBF readout may be also further improved using 
tunable RBF kernels [23] . The outlined research directions 
are planned to be pursued in the future. 



7. Conclusion 

This study has shown that the presented modeling strat- 
egy enables better understanding, design and evaluation of 
ESN models. Both presented readouts improve the gener- 
alization ability of an ESN and are viable alternatives to 
traditional linear readout or nonlinear readouts based on 
FFNN/RBF-RVM. Improved performance is not the only 
advantage of the presented approach. The importance of 
echo-regressors generated via the ESN state-space equa- 
tion can be transparently analyzed using the presented 
strategy. Suitability of the state-space parameters and 
overall ESN structure for a task at hand can therefore be 
inspected in a straightforward manner. 

Future work will be targeted towards the improvements 
of our strategy which are outlined in Section [6] The pro- 
posed improvements are likely to further advance the de- 
sign of ESN models producing stable and parsimonious 
ESN models that generalize well. 
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